Crime vs. High School Educational Outcomes – Montgomery County, MD

Nick Schaap

CMSC320 | Spring 2022

Background

Growing up in Montgomery County, MD, I have first-hand knowledge of some of the biggest issues faced by my county's school district. MCPS is the 14th largest public school district in the US by enrollment. It is also extremely diverse. Its 165,000+ students comprise more than 150+ countries and speak more than 150 languages (https://www.montgomeryschoolsmd.org/about/). One of the biggest issues faced by MCPS is the achievement gap. Currently, there exist large discrepancies in educational outcomes across MCPS's 26 high school clusters. This can largely be attributed to various external factors such as income level and other environmental factors. I wanted to examine specifically the correlation between crime and key school profile information such as graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS), and student to staff ratios to see if MCPS is providing more staff support in underprivileged areas.

In [ ]:
import folium
from folium.plugins import HeatMap
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from fuzzywuzzy import fuzz
import geopy.distance

Data Collection

I wanted to collect both graduation rate and other profile information for each high school as well as crime data. The crime data dated as far back as 2013. I used the MCPS website to collect school at a glance information which was available up until 2021. Thus, I retrieved data from both sources from 2013-2021. I will be focusing my analysis during this time frame as this is when I was able to find data from both sources. The data is tabulated in pdf form so I quickly copied and pasted the data into an excel sheet and exported it as a CSV. Then I used pandas to do some data wrangling and parse the values for each school and year. I also used an open data API to retrieve school lattitude and longitude data so that I would be able to to filter crime data that is nearby certain schools. I matched the API data to the data retreived from the MCPS website using a library called fuzzywuzzy. I used fuzzywuzzy to sort the school info rows based off how closely the school name matched a school name value from the MCPS data. Then I took the closest match. This acccounted for any small discrepancies in how school names were formatted.

MCPS At-A-Glance Data with index matching to Montgommery County School Information API data

In [ ]:
# School profile information for year 2013-2021

# Read in the csv file
school_data = pd.read_csv('school_data.csv')

# Melt column data into individual row entries
school_data = school_data.melt(var_name="Year", value_name="Value")
for (i, row) in school_data.iterrows():
    values = row['Value'].split()
    columns = ['FARMS', 'ATTENDANCE', 'GRADUATION', 'DROPOUT', 'MOBILITY', 'STUD/STAFF RATIO', 'ENG CLASS SIZE', 'OTHER CLASS SIZE']
    # Parsing individual metric values
    school_data.at[i, 'High School'] = " ".join(values[0:(len(values) - len(columns))])
    for j in range(0, len(columns)):
        school_data.at[i, columns[-(j+1)]] = values[-(j+1)]
school_data = school_data.drop(columns=['Value'])

# Removing <,> quantifiers. Some of the profile information is intentionally masked at upper 
# boundaries for student privacy reasons.
def containsNumber(value):
    for character in value:
        if character.isdigit():
            return True
    return False

school_data['GRADUATION'] = school_data['GRADUATION'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['GRADUATION'] > 0]

school_data['FARMS'] = school_data['FARMS'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['FARMS'] > 0]

school_data['DROPOUT'] = school_data['DROPOUT'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['DROPOUT'] > 0]

school_data['ATTENDANCE'] = school_data['ATTENDANCE'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['ATTENDANCE'] > 0]

school_data['STUD/STAFF RATIO'] = school_data['STUD/STAFF RATIO'].apply(lambda x:  float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['STUD/STAFF RATIO'] > 0]


# School geographical information
schools_info = pd.read_json('https://data.montgomerycountymd.gov/resource/7ycz-azby.json')

for (i, row) in school_data.iterrows():
    school_name = row["High School"]
    find_school = schools_info.copy(deep=True)
    # Finding the closest match to the API data using fuzzywuzzy partial_ratio on the school name
    find_school['match_score'] = find_school.apply(lambda x: fuzz.partial_ratio(school_name, x["school_name"]), axis=1)
    # Sorting based on similarity scores
    find_school = find_school.sort_values(by='match_score', ascending=False).head(1).index.values[0]
    # Choosing the row index with the highest similarity as the matching index
    school_data.at[i, "school_info_index"] = find_school
school_data = school_data.astype({'school_info_index':'int'})
school_data
Out[ ]:
Year High School OTHER CLASS SIZE ENG CLASS SIZE STUD/STAFF RATIO MOBILITY DROPOUT GRADUATION ATTENDANCE FARMS school_info_index
0 2021 Bethesda-Chevy Chase HS 17.6 14.6 15.1 7.7 5.0 92.7 93.4 21.7 1
1 2021 Montgomery Blair HS 21.1 20.2 13.3 10.0 8.2 88.2 90.5 51.2 20
2 2021 James Hubert Blake HS 19.0 16.1 13.4 9.7 5.0 92.1 92.3 56.3 23
3 2021 Winston Churchill HS 22.9 19.4 12.3 ≤5.0 5.0 95.0 93.8 8.4 7
4 2021 Clarksburg HS 16.7 17.9 13.2 10.7 5.0 93.2 93.0 44.8 11
... ... ... ... ... ... ... ... ... ... ... ...
229 2013 Springbrook HS 25.9 25.6 12.1 13.9 8.9 83.3 93.1 65.6 22
230 2013 Watkins Mill HS 23.8 24.2 10.5 18.6 7.6 83.5 91.5 71.7 17
231 2013 Wheaton HS 23.4 22.6 9.8 17.9 13.4 74.1 92.4 79.9 24
232 2013 Walt Whitman HS 25.7 25.0 12.5 8.3 5.0 95.0 94.8 5.0 2
233 2013 Thomas S. Wootton HS 27.5 25.8 14.1 5.1 5.0 95.0 95.0 11.4 4

225 rows × 11 columns

Montgomery County High Schools Information API Data

In [ ]:
# Filtering the API columns to those of interest/ value
schools_info = schools_info[['school_name', 'zip_code', 'city', 'address', 'latitude', 'longitude']]
schools_info
Out[ ]:
school_name zip_code city address latitude longitude
0 Walter Johnson HS 20814 Bethesda 6400 Rock Spring Dr 39.025392 -77.130102
1 Bethesda-Chevy Chase HS 20814 Bethesda 4301 East West Hwy 38.986826 -77.088970
2 Walt Whitman HS 20817 Bethesda 7100 Whittier Blv 38.981631 -77.127673
3 Poolesville HS 20837 Poolesville 17501 Willard Rd 39.143103 -77.418780
4 Thomas S Wootton HS 20850 Rockville 2100 Wootton Pkw 39.076582 -77.183197
5 Rockville HS 20851 Rockville 2100 Baltimore Rd 39.086348 -77.118272
6 Richard Montgomery HS 20852 Rockville 250 Richard Montgomery Dr 39.077292 -77.145730
7 Winston Churchill HS 20854 Potomac 11300 Gainsborough Rd 39.044305 -77.173128
8 Col Zadok Magruder HS 20855 Rockville 5939 Muncaster Mill Rd 39.131311 -77.118806
9 Sherwood HS 20860 Sandy Spring 300 Olney Sandy Spring Rd 39.148342 -77.018772
10 Paint Branch HS 20866 Burtonsville 14121 Old Columbia Pik 39.088679 -76.947102
11 Clarksburg HS 20871 Clarksburg 22500 Wims Rd 39.225502 -77.265587
12 Damascus HS 20872 Damascus 25921 Ridge Rd 39.282496 -77.210020
13 Northwest HS 20874 Germantown 13501 Richter Farm Rd 39.151593 -77.279329
14 Seneca Valley HS 20874 Germantown 19401 Crystal Rock Dr 39.175094 -77.264332
15 Gaithersburg HS 20877 Gaithersburg 314 S Frederick Ave 39.134839 -77.195478
16 Quince Orchard HS 20878 Gaithersburg 15800 Quince Orchard Rd 39.115933 -77.254239
17 Watkins Mill HS 20879 Gaithersburg 10301 Apple Ridge Rd 39.183967 -77.215836
18 Albert Einstein HS 20895 Kensington 11135 Newport Mill Rd 39.039616 -77.067036
19 Northwood HS 20901 Silver Spring 919 University Blv W 39.035694 -77.022484
20 Montgomery Blair HS 20901 Silver Spring 51 E University Blv 39.018273 -77.012434
21 John F Kennedy HS 20902 Silver Spring 1901 Randolph Rd 39.065750 -77.039028
22 Springbrook HS 20904 Silver Spring 201 Valley Brook Dr 39.057802 -77.005681
23 James Hubert Blake HS 20905 Silver Spring 300 Norwood Rd 39.113330 -77.017506
24 Wheaton HS 20906 Silver Spring 12601 Dalewood Dr 39.061338 -77.066669
In [ ]:
# Lookup function that returns the API data row corresponding to a particular school
def get_school_info(school_name):
    return schools_info[schools_info["school_name"] == school_name].head(1).reset_index().iloc[0,:]

Montgomery County Crime Data (2013-2021)

I obtained crime data from the Montgomery County, Maryland open data archive. This data dates back to 2013 and contains over 294,000 crime records. This API was initally set to limit returned records to 1000 records. I passed in a query parameter to override this default as well as order the records according to the start_date of the logged crime event. This allowed me to avoid writing my own pagination logic to retreive all records of the dataset but also results in large response times (~30 seconds).

In [ ]:
crime_data = pd.read_json("https://data.montgomerycountymd.gov/resource/icn6-v9z3.json?$limit=300000&$order=start_date")
crime_data
Out[ ]:
incident_id offence_code case_number date nibrs_code victims crimename1 crimename2 crimename3 district ... address_street street_type start_date latitude longitude police_district_number geolocation end_date street_prefix_dir street_suffix_dir
0 201353560 9199 210046695 2021-11-17 16:35:01 90Z 1 Other All Other Offenses POLICE INFORMATION GERMANTOWN ... GREAT PARK CIR 2016-07-01T00:00:00.000 39.202690 -77.254900 5D {'latitude': '39.2027', 'longitude': '-77.2549... NaN NaN NaN
1 201089117 9105 16035632 NaT 90Z 1 Other All Other Offenses LOST PROPERTY WHEATON ... CUTSTONE WAY 2016-07-01T00:00:00.000 39.096388 -77.028240 4D {'latitude': '39.0964', 'longitude': '-77.0282... NaN NaN NaN
2 201103232 9105 16053356 NaT 90Z 1 Other All Other Offenses LOST PROPERTY SILVER SPRING ... NEW HAMPSHIRE AVE 2016-07-01T00:00:00.000 39.033464 -76.986127 3D {'latitude': '39.0335', 'longitude': '-76.9861... 2016-08-31T00:00:00.000 NaN NaN
3 201102727 9105 16052672 NaT 90Z 1 Other All Other Offenses LOST PROPERTY MONTGOMERY VILLAGE ... ANTARES DR 2016-07-01T00:00:00.000 39.160745 -77.145064 6D {'latitude': '39.1607', 'longitude': '-77.1451... NaN NaN NaN
4 201087611 9108 16033838 NaT 90Z 1 Other All Other Offenses RECOVERED PROPERTY - MONT CO. SILVER SPRING ... SOUTHAMPTON DR 2016-07-01T00:00:00.000 39.006138 -76.983357 3D {'latitude': '39.0061', 'longitude': '-76.9834... 2016-07-01T00:00:00.000 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
294469 201376063 9105 220020410 2022-05-13 16:09:48 90Z 1 Other All Other Offenses LOST PROPERTY MONTGOMERY VILLAGE ... MIDTOWN MEW 2022-05-13T16:09:00.000 0.000000 0.000000 6D {'latitude': '0.0', 'longitude': '0.0', 'human... NaN NaN NaN
294470 201376077 9199 220020428 2022-05-13 16:13:13 90Z 1 Other All Other Offenses POLICE INFORMATION ROCKVILLE ... VEIRS MILL RD 2022-05-13T16:10:00.000 39.075510 -77.114600 1D {'latitude': '39.0755', 'longitude': '-77.1146... 2022-05-13T18:04:00.000 NaN NaN
294471 201376064 9101 220020421 2022-05-13 16:34:26 90Z 1 Other All Other Offenses SUDDEN DEATH WHEATON ... HEWITT AVE 2022-05-13T16:34:00.000 39.079650 -77.063400 4D {'latitude': '39.0797', 'longitude': '-77.0634... NaN NaN NaN
294472 201376061 9199 220020431 NaT 90Z 1 Other All Other Offenses POLICE INFORMATION ROCKVILLE ... BOULDERCREST CT 2022-05-13T17:57:00.000 39.083630 -77.181600 1D {'latitude': '39.0836', 'longitude': '-77.1816... 2022-05-13T15:57:00.000 NaN NaN
294473 201376094 1399 220020473 NaT 13B 1 Crime Against Person Simple Assault ASSAULT - 2ND DEGREE MONTGOMERY VILLAGE ... GAITHER RD 2022-05-13T21:55:00.000 39.117420 -77.183700 6D {'latitude': '39.1174', 'longitude': '-77.1837... NaN NaN NaN

294474 rows × 30 columns

The Montgomery County Open Data Crime API returns various interesting information about each recorded crime including: nibrs_code, crimename1, crimename2, crimename3, start_date, latitude, and longitude. I plan to use the latitude and longitude information to filter crime events near various schools. The nibrs_code is another interesting feature that allows me to filter the crime data for specific categories of offenses as defined by the National Incident-Based Reporting System standards.

Visualizing Discrepancies in Crime Levels surrounding various High Schools

Mapping Crime Data and School Locations

In [ ]:
# Function that uses the folium library to generate a heat map of the various crimes passed 
# as well as labeling the location of each of MCPS high school
def generate_heat_map(crimes, zoom=11, center=[39.1547, -77.2405], include=None):
    map_osm = folium.Map(location=center, zoom_start=zoom, tiles = "Stamen Toner")
    for _, school in schools_info.iterrows():
        if include is not None and school["school_name"] not in include:
            continue
        folium.Marker(location=[school["latitude"], school["longitude"]], tooltip=school["school_name"],
                        icon=folium.Icon(color='red')).add_to(map_osm)
    heat_data = [[row['latitude'],row['longitude']] for _, row in crimes.iterrows()]
    HeatMap(heat_data, radius=15).add_to(map_osm)
    return map_osm

All Crimes

In [ ]:
generate_heat_map(crime_data)
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The above heat map shows crime data as well as the locations of MCPS's high schools. Just from looking at this map, no interesting patterns jump out immediately. It seems like crime levels are generally pretty uniformly distributed around each high school. I wanted to take a closer look and specifically look at crimes against people. I also want to zoom in a bit more to inspect a more detailed heat map.

Mapping Crimes Against People

In [ ]:
crimes_against_people = crime_data[crime_data['crimename1'].str.contains("Person", na=False)]
generate_heat_map(crimes_against_people, zoom=13)
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This map returned more meaningful results. Now I could clearly see how some high schools such as Wheaton High School had relatively high levels of crimes against people while other high schools such as Walt Whitman High School saw relatively low levels of crime against people.

Crimes against People in the Area Surrounding Wheaton High School

In [ ]:
center = list(get_school_info("Wheaton HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Wheaton HS"])
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Crimes against People in the Area Surrounding Walt Whitman High School

In [ ]:
center = list(get_school_info("Walt Whitman HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Walt Whitman HS"])
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Visualizing School Profile Information

Now, that I have visualized a discrepancy in crime surrounding different high schools around the county I want to begin connecting the dots and see how crime may be affecting various key educational outcomes and school profile details.

In [ ]:
for school in school_data['High School'].unique():
    school_info = school_data[school_data['High School'] == school]
    plt.plot(school_info['Year'].sort_values(), school_info["GRADUATION"], label=school)
    plt.title('Graduation rate vs. time')
    plt.xlabel('Year')
    plt.ylabel('Graduation rate')
plt.show()

# TODO: ADD PLOTS FOR OTHER PROFILE METRICS

Visualizing School Profile Metrics vs. Nearby Crime Rates

I hypothesize that an increase in crime in the area surrounding (within 5 mile radius) a high school will have a negative impact on all of the school profile metrics under study (graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS)). In order to test my hypothesize I needed to find a way to get all of the crimes in the surrounding area of each school. To accomplish this, I created a function that will filter the crime data retrieved from the Montgomery County Open Data API to select only crimes within a 5 mile radius of the passed in school.

I used a library called geopy in order to calculate the distance between two points defined by their latitude and longitude. Geopy does this using the geodesic distance between the two points. This accounts for the rounding of the earth when calculating distances. This made finding the distance between schools and crime incidents rather trivial since I was given this information in both of my datasets.

In [ ]:
def get_crimes_near_school(school_name, nibrs_codes = None, mile_radius = 5):
    # Allow filtering to select only crimes conforming to certain nibrs codes
    if nibrs_codes is not None:
        crimes = crime_data[crime_data['nibrs_code'].str.contains("|".join(nibrs_codes), na=False)]
    else:
        crimes = crime_data
    school = get_school_info(school_name)
    # Extract school latitude and longitude information
    location = tuple(school[["latitude", "longitude"]])
    crimes = crimes.copy(deep=True)
    for(i, row) in crimes.iterrows():
        crime_location = tuple(row[["latitude", "longitude"]])
        # Define a crime as being near a school if it occurred within 5 miles of the school
        crimes.at[i, "near-school"] = geopy.distance.distance(location, crime_location).miles <= mile_radius
    # Return all crimes marked as near the passed in school
    return crimes[crimes["near-school"] == True] 

In order to do some initial plotting I took the raw timebased school metric data and averaged over the entire period for which data was collected. This gave me a characteristic value for each of the schools. I also added a column showing the number of crimes committed within a 5 mile radius of the school. In order to filter down the number of crimes I specifically focused on crimes marked as family offenses (NIBRS CODE = 90F). These crimes are likely to have the most profound impact on educational outcomes as they are ones that involve minors. Crimes may include neglect and domestic violence.

In [ ]:
# Collecting a description of all crimes listed as nibrs code 90F (family offenses) in the dataset
crime_data[crime_data['nibrs_code'] == "90F"]["crimename3"].unique()
Out[ ]:
array(['FAMILY OFFENSE - NEGLECT CHILD (INCLUDES NONSUPPOR',
       'FAMILY OFFENSE (DESCRIBE OFFENSE)',
       'FAMILY OFFENSE - NEGLECT FAMILY',
       'FAMILY OFFENSE - CRUELTY TOWARD CHILD',
       'FAMILY OFFENSE - CRUELTY TOWARD WIFE'], dtype=object)
In [ ]:
# Grouping the school data by High school and the school info index and taking the mean of any columns containing floating point data
average_school_rates = school_data.groupby(by=["High School", "school_info_index"]).mean().reset_index()

for (i, row) in average_school_rates.iterrows():
    # Get the appropriate school name used by the school info dataframe
    school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
    # Get crimes near the school
    crimes = get_crimes_near_school(school_name, nibrs_codes=['90F'])
    # Count the numebr of crimes
    numCrimes = crimes["incident_id"].count()
    average_school_rates.at[i, "numCrimes"] = numCrimes

average_school_rates   
Out[ ]:
High School school_info_index STUD/STAFF RATIO DROPOUT GRADUATION ATTENDANCE FARMS numCrimes
0 Albert Einstein HS 18 10.866667 10.722222 81.800000 90.255556 65.366667 290.0
1 Bethesda-Chevy Chase HS 1 14.755556 5.000000 94.488889 93.477778 20.600000 171.0
2 Clarksburg HS 11 13.122222 5.255556 92.044444 93.211111 49.744444 149.0
3 Col. Zadok Magruder HS 8 11.333333 7.000000 86.833333 90.911111 55.366667 219.0
4 Damascus HS 12 11.888889 5.000000 91.111111 92.855556 30.388889 21.0
5 Gaithersburg HS 15 10.411111 14.822222 75.377778 87.833333 72.355556 261.0
6 James Hubert Blake HS 23 13.100000 5.222222 91.177778 93.022222 57.288889 150.0
7 John F. Kennedy HS 21 10.255556 9.844444 81.888889 89.622222 81.900000 246.0
8 Montgomery Blair HS 20 13.400000 8.122222 86.555556 91.233333 53.833333 213.0
9 Northwest HS 13 13.355556 5.100000 94.177778 92.577778 43.822222 233.0
10 Northwood HS 19 10.900000 14.444444 80.800000 88.755556 73.244444 277.0
11 Paint Branch HS 10 12.411111 7.855556 88.800000 94.544444 64.477778 77.0
12 Poolesville HS 3 14.188889 5.000000 95.000000 94.400000 14.977778 3.0
13 Quince Orchard HS 16 12.300000 5.100000 91.433333 91.766667 41.733333 224.0
14 Richard Montgomery HS 6 14.100000 5.100000 91.555556 93.011111 37.555556 201.0
15 Rockville HS 5 9.777778 6.144444 85.311111 92.688889 51.433333 229.0
16 Seneca Valley HS 14 10.011111 7.588889 83.766667 90.600000 65.911111 257.0
17 Sherwood HS 9 12.888889 5.000000 91.055556 93.188889 28.455556 58.0
18 Springbrook HS 22 11.366667 9.488889 84.477778 91.944444 71.466667 265.0
19 Thomas S. Wootton HS 4 14.000000 5.000000 95.000000 94.911111 13.200000 164.0
20 Walt Whitman HS 2 12.711111 5.000000 95.000000 94.188889 5.011111 69.0
21 Walter Johnson HS 0 13.766667 5.000000 93.944444 94.011111 18.388889 166.0
22 Watkins Mill HS 17 10.133333 11.077778 80.811111 88.477778 79.088889 287.0
23 Wheaton HS 24 11.933333 14.455556 77.922222 91.322222 76.511111 244.0
24 Winston Churchill HS 7 12.788889 5.000000 95.000000 93.700000 9.322222 108.0

I created a helper function that allows me to plot a passed in school profile metric vs number of crimes. This allowed me to see the relationship between the two if one existed. I also used numpy to add a linear regression line to the resulting plots.

In [ ]:
def plot_school_rates_vs_crime(rate, plt, label_min=None):
    x = average_school_rates["numCrimes"]
    y = average_school_rates[rate]
    schools = average_school_rates["High School"]
    coef = np.polyfit(x, y, 1)
    poly1d_fn = np.poly1d(coef) 
    plt.plot(x,y, 'yo', x, poly1d_fn(x), '--k')
    for i, label in enumerate(schools):
        if label_min is not None and y[i] < label_min:
            plt.annotate(label, (x[i], y[i]))
    plt.set_title(f'Average {rate.lower()} rate vs. number of family crimes (2013-2021)')
    plt.set_xlabel('Number of family-related crimes between (2013-2021)')
    plt.set_ylabel(f'Average {rate.lower()} rate')

# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1)
plot_school_rates_vs_crime("FARMS", ax2)
plot_school_rates_vs_crime("ATTENDANCE", ax3)
plot_school_rates_vs_crime("DROPOUT", ax4)
for ax in axs:
    ax.label_outer()